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Abstract:  The  successive  over-relaxation  (SOR) 
iterative  method  is  an  important  solver  for  linear 
systems.  In  this  paper,  a  parallel  algorithm  for  the 
red-black  SOR  method  with  domain  decomposition  is 
investigated.  The  parallel  SOR  algorithm  is  designed 
by  combining  the  traditional  red-black  SOR  and  row 
block  domain  decomposition  technique,  which  reduces 
the  communication  cost  and  simplifies  the  parallel 
implementation.  Two  other  iterative  methods,  Jacobi 
and  Gauss-Seidel(G-S),  are  also  implemented  in 
parallel  for  comparison.  The  three  parallel  iterative 
algorithm  are  implemented  in  C  and  MPI  (Message 
Passing  Interface)  for  solving  the  Dirichlet  problem 
on  a  Linux  cluster  with  eight  dual  processor  2.6ghz  32 
bit  Intel  Xeons,  totaling  16  processors.  The 
performances  of  the  three  algorithms  are  evaluated  in 
terms  of  speedup  and  efficiency. 

Keywords :  Parallel  algorithm,  successive  over¬ 
relaxation  (SOR)  iteration,  Linux  cluster,  message 
passing  interface  (MPI). 


1.  Introduction 

The  successive  over-relaxation  (SOR) 
iterative  method  is  an  important  solver  for  linear 
systems.  The  SOR  method  is  inherently 
sequential  in  its  original  form.  To  take  advantage 
of  the  supercomputing  resource  with  multiple 
processors,  several  parallel  versions  of  the  SOR 
method  have  been  proposed.  One  of  the  widely 
used  parallel  versions  is  the  multi-color  SOR 
method  which  uses  the  multi-color  ordering 
technique  [1].  For  complicated  problems,  two  or 
more  colors  are  required  to  define  a  multicolor 
ordering.  Harrar  II  [2]  and  Melhem  [3]  studied 
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how  to  quickly  verify  and  generate  a 
multicoloring  ordering  according  to  the  given 
structure  of  a  matrix  or  a  grid.  However,  the 
multi-color  SOR  method  is  parallel  only  within 
the  same  color.  For  some  problems  such  as  two 
dimensional  (2D)  heat  transfer  described  by 
Poisson  equations,  the  Red-Black  two-color  SOR 
method  is  preferred.  Yanheh  [4]  showed  that  the 
Red-Black  SOR  method  is  more  efficient  and 
smoother  than  the  sequential  SOR  method.  Xie 
proposed  an  efficient  parallel  SOR  method 
(PSOR)  using  domain  decomposition  and 
interprocessor  data  communication  techniques 
[5].  It  is  shown  that  PSOR  is  just  the  SOR 
method  applied  to  a  reordered  linear  system,  so 
that  the  theory  of  SOR  can  also  be  applied  to  the 
analysis  of  PSOR.  Other  techniques  such  as 
pipeline  of  computation  and  communication  and 
an  optimal  schedule  of  a  feasible  number  of 
processors  are  studied  and  applied  to  define 
parallel  versions  of  SOR  for  banded  or  dense 
matrices  problems  [6]  and  they  can  be 
implemented  in  parallel  without  changing  the 
sequential  SOR  method.  The  parallel  SOR 
method  for  particular  parallel  computers  can  also 
be  found  in  [7], 

Most  of  these  early  studies  on  parallel  SOR 
methods  focused  on  their  mathematical  properties 
and  were  designed  for  MIMD  machines.  Due  to 
the  demanding  cost  of  supercomputers,  Linux 
clusters  have  drawn  more  and  more  attention  in 
higher  performance  computing  and  have  been 
used  for  solving  a  variety  of  computational 
problems  in  science  and  engineering.  The 
performance  of  SOR  methods  on  distributed 
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memory  platforms,  e.g.,  Linux  clusters,  may  be 
quite  different  from  that  on  shared  memory 
supercomputers,  such  as  SGI  Onyx  10000.  The 
actual  performance  of  these  algorithms  depends 
on  the  hardware,  interconnection  of  processors 
and  implementation.  To  design  an  efficient 
parallel  SOR  method,  task  decomposition  and 
task  dependency  must  be  investigated  to  achieve 
the  maximum  degree  of  concurrency  and  to 
minimize  the  communication  cost  in  parallel 
computation.  In  this  paper,  a  parallel  version  of 
the  SOR  method  is  designed  based  on  the  widely- 
used  Red-Black  SOR  (RB-SOR)  and  row  block 
domain  decomposition.  The  parallel  SOR  method 
is  implemented  in  MPI  and  C  language  and  tested 
on  a  Linux  cluster.  The  interprocess 
communication  and  task  dependency  are 
analyzed.  The  performance  results  such  as 
convergence  rate,  speedup  and  efficiency  are 
given  and  also  compared  with  other  methods  such 
as  Jacobi  method  and  Gaussian-Seidel  (G-S) 
method. 


2.  Serial  and  Parallel  SOR  Methods 

The  SOR  method  is  a  widely  used  iterative 
procedure  to  solve  a  linear  system  or  a  partial 
differential  equation  discretized  by  a  finite 
difference  method.  Consider  the  Dirichlet 
problem 

\uxx  +  uyy  =  f{x,y),  (x,y)e  Q  ^ 

|w(  x,y)=g(x,y),  (x,y)edQ 

with  Q  e=  (0, 1)  x  (0, 1) .  Assuming  a  uniform 
partition,  the  intervals  I*  =  Iy  =  [0, 1]  are  divided 
into  ns  subintervals  /*  =  Iym,m  =  \,...,ns .  It 

generates  a  uniform  grid  with  spacing  h  =  —  and 

", 

nodal  coordinates  (x, ,y,),  where  xj=(i-\)h 
and  y,  =  ( J  ~  1  )h,  i,j  =  1, (",  + 1)  •  The  finite 

difference  method  (FDM)  is  used  to  discretize  the 
Eq.(l).  We  consider  the  5-point  approximation  to 
Eq.(l)  on  a  unit  square  Qh  and  obtain  the 
following  finite  different  scheme  at  node  (j,j) 


uij-i+uij+l+ui-ij+uMJ-Mj=h2fiJ  in  Qh  (2) 

where  f  y  is  the  value  of  the  function  / (x,  y) 
at  the  node  (/,/)  and  m#  .  denotes  the 
approximation  of  u(xt,y i). 


2.1  Serial  iterative  methods 

The  Eq.  (2)  can  be  rewritten  as  a  matrix  form 
Au  =  f  where  A  is  a  matrix  and  both  u  and  f 
are  vectors.  The  solution  can  be  obtained  using 
iterative  methods  such  as  Jacibi  method,  G-S 
method  and  SOR  method.  After  making  an  initial 
guess  of  u  ,  e.g.,  u(ry> ,  the  Jacobi  iterative  method 
generates  a  sequence  of  approximations 
w(t),  k  =  1,2,3,... ,  to  the  solution.  The 

approximation,  «(4+l) ,  at  the  (k+1)  iteration  is 
computed  using  the  results  obtained  in  the 
Ath  iteration: 


where  at  node  (i,j)  of  (A  + 1)  th  iteration 

relies  on  the  values  of  the  four  neighboring 
nodes  (i-l,y),  (i  +  IJ),  (t,j- 1)  and  (i,j  + 1) 
obtained  at  the  previous  At h  iteration. 

The  parallel  Jacobi  iterative  method  is  easy  to 
implement  in  parallel,  but  its  convergence  rate  is 
very  low.  It  is  seen  that  if  we  update  the 


w  according  to  increasing  values  of  subscripts  i 

and  /,  the  most  current  values  w(i+1)  and 

/-I./  ij-\ 

are  already  available  when  we  compute  the  new 
update  «(*+1)  •  This  suggests  us  to  make  use  of  the 
most  recent  values  at  node  (i,j  -1)  to 

update  w(*+1)  and  it  results  in  G-S  iterative 
method 


The  convergence  rate  of  G-S  iterative  method 
can  be  further  improved  by  applying  SOR 
iteration.  For  any  cot-  0 ,  Eq.  (4)  can  be  rewritten 
as 


in  which  most  recently  computed  values  w(<j+1) 
and  «(A+I)  are  used  as  soon  as  they  are  available. 

/./-i  J 

The  optimal  value  of  CO  lies  in  (0,  2).  The  choice 
of  co  =  1  corresponds  to  the  Gauss-Seidel 
iteration. 

2.2  Red-Black  SOR  method 

Eq.  (5)  can  be  implemented  in  parallel  using 
the  red-black  ordering  technique.  The  node 
(i,j)  is  denoted  red  or  black  according  to 
whether  i  +  j  is  odd  or  even.  If  i  +  j  is  odd,  the 
node  (i,j)  is  marked  red,  and  if  /  +  j  is  even, 
the  node  (i,j)  is  marked  black.  The  red-black 
ordering  is  illustrated  in  Fig.  1.  The  evaluation  of 
each  corresponding  to  red  nodes  involves 

the  values  of  black  nodes  only,  and  vice  versa. 
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Fig.  1  Red-black  ordering  technique 

Based  on  the  red-black  ordering  technique, 
the  approximation  can  be  updated  in  a 

different  order  suggested  by  Eq.  (5).  Each 
iteration  of  this  method  consists  of  two  phases: 
(1)  updating  all  the  red  values  first  and  then  (2) 
updating  all  the  black  values.  The  two  phases  are 
illustrated  by  Fig.  2(a)  and  Fig.  2(b),  respectively. 
For  red  nodes  where  i  +  j  is  odd,  we  have 

+1p  -h£.-ku/»  (6) 

For  black  nodes  where  i  +  j  is  even, 
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(a)  phase  1 
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Fig.  2.  Two  phases  of  red-black  implementation 
of  the  SOR  algorithm,  (a)  Phase  1 :  updating  the 
values  of  the  black  nodes,  (b)  Phase  2:  updating 
the  values  of  the  black  nodes. 

In  the  phase  1,  the  computation  of  all  red 
nodes  depends  on  the  values  of  all  black  nodes 
which  are  already  available  in  the  Phase  2  at  the 
last  iteration.  Thus,  the  computation  can  be 
partitioned  into  a  number  of  independent  tasks 
and  performed  by  multiprocessors.  In  the  phase  2, 
the  computation  of  all  black  nodes  depends  on 
the  values  of  all  red  nodes  which  have  been 
computed  in  the  Phase  1 .  Similarly,  it  can  also  be 
partitioned  into  a  number  of  independent  tasks 
and  performed  by  multiprocessors,  as  described 
in  section  2.3.  Multicolor  SOR  methods  can  be 
derived  in  a  similar  way. 

2.3  Domain  decomposition 

Since  the  SOR  method  is  applied  for  all 
nodes  in  the  computation  domain,  an  intuitive 
way  to  implement  it  in  parallel  is  to  divide  the 
nodes  into  a  number  of  subsets  and  each  process 


performs  the  computation  on  one  subset  of  nodes. 
This  approach  results  in  a  domain  decomposition. 
A  rectangular  computational  domain  can  be 
partitioned  into  a  number  of  subdomains  using 
either  row  block  partition  or  checkerboard 
partition.  In  the  checkerboard  partition,  each 
process  needs  to  communicate  with  several  (up  to 
four)  adjacent  processes  and  leads  to  more 
communication  cost.  In  the  row  block  partition, 
each  process  communicates  with  at  most  two 
processes  and  the  communication  pattern  is 
simple  as  we  will  see  in  the  next  section.  Thus,  a 
row  block  domain  decomposition  method  is 
employed  in  this  study.  Without  loss  of 
generality,  the  n  rows  of  the  mesh  are  divided 
evenly  into  p  consecutive  blocks,  where  p  is  the 
number  of  processes  used.  The  entire 
computation  is  partitioned  into  p  tasks,  each  of 
which  is  assigned  onto  one  process  for  execution. 
There  are  a  total  of  mn/p  nodes  in  each 
subdomain.  The  two  adjacent  processes  must 
exchange  the  data  of  their  local  boundary  nodes. 

2.4  Interprocess  communication 

To  design  an  efficient  parallel  algorithm,  we 
need  to  reduce  both  the  computation  cost  of  each 
task  and  the  interprocess  communication  cost  to 
achieve  high  performance.  At  the  end  of  each 
phase  of  red-black  SOR  algorithm,  all  processes 
are  synchronized.  Based  on  the  row  black  domain 
decomposition,  row  distribution  is  done  to  ensure 
that  each  processor  gets  an  even  amount  of  rows 
in  order  to  balance  the  computational  load.  Each 
process  only  needs  to  communicate  with  its 
adjacent  processes  and  sends  them  the  most 
recent  values  on  the  internal  boundary,  as  shown 
in  Fig.  3.  At  the  end  of  Phase  1,  the  process  P 
sends  the  value  of  the  red  node  (/,  j)  to  the 
adjacent  process  p  and  the  P  ,  sends  the 
values  at  red  nodes  (/'  - 1,  j  - 1)  and  (/  + 1,  j  - 1) 
to  the  nodes  (i  - 1,  j)  and  (i  + 1,  j)  on  process  Pq . 

At  the  end  of  Phase  2,  the  communication  is  quite 
similar  except  that  the  direction  of 
communication  is  reversed. 

Internal  boundary  rows  are  communicated 
using  a  non-blocking  MPI_Isend  and  a  blocking 
MPI_Recv.  This  ensures  that  deadlocks  do  not 
occur.  Additionally,  the  root  process  P0 

containing  the  top  rows  only  needs  to 


communicate  with  the  process  Pi .  The  last 
process  p  containing  the  last  groups  of  rows 
only  needs  to  communicate  with  the  process  p  , . 

Additionally,  convergence  is  calculated  once  per 
iteration  using  MPI_Reduce  and  its 
communication  cost  is  tw  log  p .  Ignoring  the  per- 

hop  time  and  start-up  time,  the  total 
communication  cost  in  Phase  1  and  2  of  red-black 
SOR  parallel  algorithm  for  each  iteration  is 
derived  as  given  as  follows 

c=c  +c 

^  phase]  '  ^ phase!  reduce 

=twm(p-l)+ twm(p  -l)+tw\ogp 
where  p  is  the  total  number  of  processes,  m  is 
the  number  of  node  on  the  inner  boundary  which 
is  also  equal  to  the  node  number  in  x  direction, 
tw  is  the  transfer  time  per  word,  and  C  is  total 

communication  cost  which  is  the  sum  of  the 
send-receive  communication  costs  in  phases  1 
and  2,  and  the  communication  cost  of 
MPI  Reduce  operation  at  the  end  of  each 
iteration. 
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Fig.  3  Communications  between  two  adjacent 
processes  (Phase  1) 

2.5  Parallel  SOR  algorithms 

Based  on  the  above  design  and  analysis,  the 
algorithm  of  parallel  SOR  with  domain 
decomposition  is  developed,  as  shown  in 
Algorithm  1. 


procedure  Paral lcl_SOR(My_ID,  CO ,  U_Curr,  U_Next) 

begin 

while  (!  isConvergent  &  iteration  <  max_iter)  { 
/*  update  U_Curr  which  holds  solution  from 
previous  iteration  */ 

U_Curr  =  U_Next; 
if  (MyJD  !=  ROOT) 

{above_row  =  get_above(U_Curr);} 
if  (MyJD  !=  LAST) 

{below_row  =  get_be!ow(U_Curr);} 

/*  Phase  1 :  solve  for  red  values  using  black 
values  from  the  previous  iteration  */ 
solve_red(U_Curr,  U_Next,  above_row, 
below_row,  CO ); 

/*  Phase  2:  solve  for  black  values  using  red 
values  for  this  iteration  */ 
solve_black(UJfext,  U_Next,  above_row, 
below_row,  CO ); 

/*  determine  if  it  is  convergent*/ 
errorjocal  =  Compute_Error(); 
if  (MyJD  =  =  ROOT)  { 
isConvegent  = 

convergence_control(errjocal); 

}  //end while 

gather  partial  result  U  Next  from  all  processes 
end  Parallel  SOR  algorithm _ 


Algorithm  1  Parallel  SOR  algorithm  with  domain 
decomposition  for  distributed  memory  platform  (Linux 
cluster),  in  which  My  ID  is  the  rank  of  each 
process,  U  Curr  represents  u®,  U_Next 
represents  u(F'  and  above_row  and  below_row 
are  the  two  boundary  rows  in  the  subdomain  used 
for  communication 

In  this  algorithm,  the  initial  global  data  are 
scattered  onto  multiple  processors  and  each 
processor  only  stores  its  local  data.  At  the  end  of 
the  phases  1,  synchronization  is  enforced  since 
the  most  recent  values  of  all  red  nodes  will  be 
used  for  computation  of  black  nodes  in  phase  2. 
Fig.  2  shows  that  the  update  of  the  values  of  all 
red  nodes  can  be  performed  in  parallel  because 
the  values  of  all  black  nodes  are  available  and  the 
tasks  corresponding  to  different  subdomains  are 
independent,  which  simplifies  the  parallel 
implementation.  The  same  procedure  is  applied 
for  updating  the  values  of  all  black  nodes  in 
phase  2.  It  is  seen  that  for  each  phase  of  the  SOR 
algorithm,  the  parallel  computation  is  quite 
similar  to  the  Jacobi  method.  Synchronization  is 
also  needed  at  the  end  of  phase  2  for  the  same 
reason  it  is  used  with  phase  1.  Since  the  parallel 


SOR  is  an  iterative  algorithm,  at  the  end  of  each 
iteration,  the  parallel  computation  is 
synchronized.  The  ROOT  process  determines 
whether  the  computation  is  convergent  or  not.  If 
it  is  not  convergent,  the  process  continues  and 
goes  to  the  next  iteration.  Convergence  is  tested 
using  the  following  criterion 

||«(*+1)  -  u(k) ||2  <  e  (9) 

where  u(k+v>  is  the  nodal  solution  for  the  latest 

iteration,  u<k)  is  the  nodal  solution  for  the 

immediately  past  iteration,  and  s  is  the 
convergence  tolerance.  The  ROOT  process  sums 
up  the  squared  difference  of  the  current  and  new 
value  at  each  node,  and  takes  the  square  root  of 
the  total  sum  to  test  convergence. 

3.  Implementation 

Many  programming  languages  and  libraries 
have  been  developed  for  explicit  parallel 
programming.  They  differ  in  the  view  of  the 
address  space,  degree  of  synchronization  and 
multiplicity  of  programs.  The  message  passing 
interface  (MPI)  is  a  standard  for  writing  message 
passing  programs.  MPI  was  originally  targeted 
for  distributed  memory  systems,  such  as  a  cluster 
of  PCs,  but  it  is  supported  on  virtually  all  high 
performance  computing  (HPC)  platforms, 
including  shared  memory  platforms,  e.g.,  SGI 
Origin.  In  this  paper,  MPI  is  chosen  for 
developing  parallel  computing  applications 
because  of  its  standardization,  portability, 
performance,  functionality  and  availability.  The 
subroutines  and  functions  in  MPI  are  called  from 
C  code.  The  serial  algorithm  is  implemented  first, 
and  then  modified  to  support  parallel  computing. 
Both  serial  and  parallel  implementations  produce 
the  same  convergent  results. 

For  most  iterative  scientific  computing 
applications,  when  data  decomposition  technique 
is  employed,  all  processes  execute  the  same  code 
but  perform  on  different  data  sets.  This  is  the 
single  program  and  multiple  data  (SPMD) 
paradigm  used  for  implementing  parallel 
algorithms.  Parallelism  in  MPI  is  explicit  so 
developers  need  to  consider  how  different 
processes  perform  their  task  and  communicate 
with  each  other  in  the  parallel  algorithm  design. 


To  validate  the  parallel  SOR  algorithm  and 
evaluate  its  performance  in  Linux  cluster,  the 
connventional  parallel  Jacobi  method  is  also 
implemented  using  C  and  MPI.  The  Jacobi 
method  is  described  by  the  Eq.  (3),  in  which  the 

new  value  w(i+1)  of  any  node  at  current  iteration 

u 

depends  only  on  the  old  values  ,  u(k) , 

l-Uj  IJ- 1  MJ 

and  of  the  four  neighboring  nodes  at  the 

previous  iteration.  After  a  row  block  domain 
composition  is  applied  for  the  entire  computation 
domain,  the  task  corresponding  to  each 
subdomain  is  to  update  the  values  of  all  inner 
nodes  in  this  subdomain.  Obviously,  these  tasks 
are  independent  and  can  be  easily  implemented  in 
parallel.  If  the  parameter  co  in  the  parallel  SOR 
algorithm  is  equal  to  1,  e.g.,  co= 1,  the  parallel 
SOR  reduces  to  the  Gauss  Seidel  parallel 
algorithm.  The  parallel  G-S  is  implemented  in  the 
same  as  the  parallel  SOR  except  for  the  value 
parameter  co . 

All  performance  tests  were  conducted  on 
eight  dual  processor  2.6ghz  32  bit  Intel  Xeons, 
totaling  16  processors  in  all.  The  parallel 
computing  platform  containing  these  computers 
was  a  lOOmbs  switched  Ethernet  bus  network 
local  area  network. 
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Fig.  4.  Performance  comparisons  of  the  Jacobi, 
G-S  (©  =  1.0),  and  SOR  ( co  =  1 .9  iterative 
methods,  (a)  number  of  iterations  with  respect  to 
different  methods,  (b)  serial  time  with  respect  to 
different  methods. 


4.  Performance  Results 

The  SOR,  G-S  and  Jacobi  iterative  methods 
are  implemented  in  both  serial  and  in  parallel. 
Globalized  output  files  were  validated  for  both 
serial  and  parallel  implementations  of  each 
method,  and  were  found  to  be  identical  for  all 
three  methods  on  the  domain  with  a  50  x  50  mesh 
and  a  100  x  100  mesh.  All  three  methods  agree  to 
the  1000th  place.  Fig.  4  shows  the  comparison  of 
the  number  of  iterations  needed  to  converge  and 
the  total  time  each  method  takes  to  converge 
when  run  in  serial.  In  SOR  method,  ©  =  1.9  and 
in  G-S  method,  ©  =  1.0.  Among  the  three 
iterative  methods,  SOR  is  clearly  the  fastest 
method  in  terms  of  serial  time  and  the  number  of 
iterations. 


Performances  are  evaluated  in  terms  of 
speedup  and  efficiency.  The  speedup  S  is 
defined  as  S  =  Ts/Tp  and  the  efficiency  E  is 

defined  as  E  =  S  /  P  where  T  and  T  are  serial 

5  p 

execution  time  and  parallel  execution  time, 
respectively,  and  P  is  the  number  of  processors. 
The  performance  results  with  respect  to  the 
50x50  and  100x100  meshes  and  three  methods 
are  given  in  Fig.  5  and  Fig.  6. 

From  Fig.  5(a)  and  6(a),  when  the  number  of 
processors  increases,  the  speedup  increases.  The 
actual  speedup  is  smaller  than  the  ideal  speedup 
because  the  communication  cost  is  relatively 
higher  when  implemented  and  executed  on  a 
Linux  cluster,  compared  with  the  case  when 
executed  on  a  share  memory  platform.  From  Fig. 
5(b)  and  6(b),  it  is  seen  that  when  more 
processors  are  used  for  parallel  computation,  the 
communication  cost  increases,  as  given  by  Eq. 
(8),  and  the  efficiency  decreases. 


5.  Conclusions 


Fig.  5.  Speedup  and  efficiency  with  respect 
to  three  methods  for  the  50  x  50  mesh 


Fig.  6.  Speedup  and  efficiency  with  respect 
to  three  methods  for  the  100  x  100  mesh 


The  paper  designs  the  parallel  algorithm  for 
red-black  SOR  iterative  method  with  domain 
decomposition  and  compares  it  with  the  parallel 
Jacobi  method  and  G-S  method.  All  three  parallel 
iterative  methods  are  implemented  in  C  and  MPI 
and  executed  on  a  Linux  cluster  with  eight  dual 
processor  2.6ghz  32  bit  Intel  Xeons,  totaling  16 
processors.  The  computers  are  connected  by  a 
lOOmbs  switched  Ethernet  bus  network.  The 
performance  shows  that,  of  the  three  iterative 
methods,  SOR  converges  fast  with  a  properly 
chosen  parameter  (O ,  e.g.  co  =  1 .9  .  The  speedup 
of  the  three  methods  increases  but  the  efficiency 
decreases  when  the  number  of  processors 
increases.  In  addition,  the  speedup  and  efficiency 
plots  are  quite  similar  for  50x50  and  100x100 
meshes.  The  communication  cost  increases  with 
an  increase  of  the  number  of  processors  so  the 
speedup  of  these  methods  are  smaller  when 
executed  on  a  Linux  cluster  than  that  when 
executed  on  SGI  share  memory  platform. 
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